昨天講到要怎麼建立環境和下載資料集,今天要來講文字的處理
由於模型沒有辦法直接訓練文字,因此要對文字做一些處理
這些文字要先轉換成一些數字,有一個處理方法在這個網站
這邊要先下載並且import儲存的資料集
model_name = "ted_hrlr_translate_pt_en_converter"
tf.keras.utils.get_file(
f"{model_name}.zip",
f"https://storage.googleapis.com/download.tensorflow.org/models/{model_name}.zip",
cache_dir='.', cache_subdir='', extract=True
)
Downloading data from https://storage.googleapis.com/download.tensorflow.org/models/ted_hrlr_translate_pt_en_converter.zip
188416/184801 [==============================] - 0s 0us/step
196608/184801 [===============================] - 0s 0us/step
'./ted_hrlr_translate_pt_en_converter.zip'
接著要讓tensorflow去讀取下載的model
tokenizers = tf.saved_model.load(model_name)
透過dir()
可以知道tokenizers有哪些method可以使用
[item for item in dir(tokenizers.en) if not item.startswith('_')]
['detokenize',
'get_reserved_tokens',
'get_vocab_path',
'get_vocab_size',
'lookup',
'tokenize',
'tokenizer',
'vocab']
tokenize
就是將string轉成ID的方法
for en in en_examples.numpy():
print(en.decode('utf-8'))
and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n't test for curiosity .
en_examples
是昨天切出來的資料集中的前三筆資料
encoded = tokenizers.en.tokenize(en_examples)
for row in encoded.to_list():
print(row)
[2, 72, 117, 79, 1259, 1491, 2362, 13, 79, 150, 184, 311, 71, 103, 2308, 74, 2679, 13, 148, 80, 55, 4840, 1434, 2423, 540, 15, 3]
[2, 87, 90, 107, 76, 129, 1852, 30, 3]
[2, 87, 83, 149, 50, 9, 56, 664, 85, 2512, 15, 3]
這邊可以看到轉換過後的資料會長什麼樣子
最前面的2是開頭,最後面的3是結尾
detokenize
可以把ID轉換回文字
round_trip = tokenizers.en.detokenize(encoded)
for line in round_trip.numpy():
print(line.decode('utf-8'))
and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n ' t test for curiosity .
用lookup
可以將token的ID轉換成token的文字
tokens = tokenizers.en.lookup(encoded)
tokens
<tf.RaggedTensor [[b'[START]', b'and', b'when', b'you', b'improve', b'search', b'##ability', b',', b'you', b'actually', b'take', b'away', b'the', b'one', b'advantage', b'of', b'print', b',', b'which', b'is', b's', b'##ere', b'##nd', b'##ip', b'##ity', b'.', b'[END]'], [b'[START]', b'but', b'what', b'if', b'it', b'were', b'active', b'?', b'[END]'], [b'[START]', b'but', b'they', b'did', b'n', b"'", b't', b'test', b'for', b'curiosity', b'.', b'[END]']]>
從上面的token可以看到,這個方法會把一些詞性相關的subword呈現出來